Utilizing Machine learning techniques to generate value from a data set of Pulp Sensibility. Using supervised learning algorithms to solve the classification problem of predicting the need of suppliment and also try to find out the root cause of this problem
We have collected 128 patients data with 20 features including the binary target feature whether a patient Need Supliment or not.
Data profile report to explore the contents of the collected data set.
Observations:
Observations:
Observations:
Observations:
Observations:
Observations:
Observations:
Observations:
Observations:
Most machine learning algorithms can only work with numeric data so it was necessary to encode the categorical features into numeric features. As all of the categorical features in the data set are nominal, i.e., their classes have no meaningful order, I used one-hot encoding to convert the categorical features into indicator variables, also known as dummy variables. One-hot encoding creates a new dummy variable for each class in a categorical feature, where a value of 1 for a dummy variable indicates the presence of the class and a value of 0 indicates the absence of the class.
Feature selection is a method of filtering out the important features as all the features present in the dataset are not equally important. There are some features that have no effect on the output. So we can skip them. As our motive is to reduce the data before feeding it to the training model.
We are selecting top 15 Features and for finding we are doing Pearson’s correlation, chi-square and LightGBM test
We check the absolute value of the Pearson’s correlation between the target and numerical features in our dataset. We keep the top n features based on this criterion.
we calculate the chi-square metric between the target and the numerical variable and only select the variable with the maximum chi-squared values.
We can also use RandomForest to select features based on feature importance.In Random forest, the final feature importance is the average of all decision tree feature importance.
We could also have used a LightGBM. Or an XGBoost object as long it has a featureimportances attribute.
GridSearch cross-validation for the logistic regression model is performed below.
GridSearch cross-validation for the KNN model is performed below.
Having trained and cross-validated the models, I then used the models to make predictions on the test set. I evaluated the performance of the models on the test set using the same F1 and accuracy metrics used to evaluate the models during cross-validation. The performance of the models as indicated by these metrics is displayed below.
To objectively determine the degree of bias and variance exhibited by the models, I used the guidelines presented below.
Bias:
Variance: